Tutorial Outline

Introduction

Twitter provides two types of API to access their data:

  • RESTful API: Used to get data about existing data objects like statuses "tweets", user, ... etc
  • Streaming API: Used to get live statuses "tweets" as they are sent

The reason why you would like to use streaming API:

  • Capture large amount of data because RESTful API has limited access to older data
  • Real-time analysis like monitoring social discussion about a live event
  • In house archive like archiving social discussion about your brand(s)
  • AI response system for a twitter account like automated reply and filing questions or providing answers

Preprerequisites

  • Python 2 or 3
  • Jupyter /w IPyWidgets
  • Pandas
  • Numpy
  • Matplotlib
  • MogoDB Installtion
  • Pymongo
  • Scikit-learn
  • Tweepy
  • Twitter account

How does it work?

Twitter streaming API can provide data through a streaming HTTP response. This is very similar to downloading a file where you read a number of bytes and store it to disk and repeat until the end of file. The only difference is this stream is endless. The only things that could stop this stream are:

  • If you closed your connection to the streaming response
  • If your connection speed is not capable of receiving data and the servers buffer is filling up

This means that this process will be using the thread that it was launched from until it is stopped. In production, you should always start this in a different thread or process to make sure your software doesn't freeze until you stop the stream.

Authentication

You will need four numbers from twitter development to start using streaming API. First, let's import some important libraries for dealing with twitter API, data analysis, data storage ... etc


In [1]:
import numpy as np
import pandas as pd
import tweepy
import matplotlib.pyplot as plt
import pymongo
import ipywidgets as wgt
from IPython.display import display
from sklearn.feature_extraction.text import CountVectorizer
import re
from datetime import datetime

%matplotlib inline

Authentication keys

  1. Go to https://apps.twitter.com/
  2. Create an App (if you don't have one yet)
  3. Grant read-only access to your account
  4. Copy the four keys and paste them here:

In [2]:
api_key = "yP0yoCitoUNgD63ebMerGyJaE" # <---- Add your API Key
api_secret = "kLO5YUtlth3cd4lOHLy8nlLHW5npVQgUfO4FhsyCn6wCMIz5E6" # <---- Add your API Secret
access_token = "259862037-iMXNjfL8JBApm4LVcdfwc3FcMm7Xta4TKg5cd44K" # <---- Add your access token
access_token_secret = "UIgh08dtmavzlvlWWukIXwN5HDIQD0wNwyn5sPzhrynBf" # <---- Add your access token secret

auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

MongoDB Collection

Connect to MongoDB and create/get a collection.


In [3]:
col = pymongo.MongoClient()["tweets"]["StreamingTutorial"]
col.count()


Out[3]:
2251

Starting a Stream

We need a listener which should extend tweepy.StreamListener class. There is a number of methods that you can extend to instruct the listener class to perform functionality. Some of the important methods are:

  • on_status(self, status): This will pass a status "tweet" object when a tweet is received
  • on_data(self, raw_data): Called when any any data is received and the raw data will be passed
  • on_error(self, status_code): Called when you get a response with code other than 200 (ok)

Stream Listener


In [4]:
class MyStreamListener(tweepy.StreamListener):
    
    counter = 0
    
    def __init__(self, max_tweets=1000, *args, **kwargs):
        self.max_tweets = max_tweets
        self.counter = 0
        super().__init__(*args, **kwargs)
    
    def on_connect(self):
        self.counter = 0
        self.start_time = datetime.now()
    
    def on_status(self, status):
        # Increment counter
        self.counter += 1
        
        # Store tweet to MongoDB
        col.insert_one(status._json)
        
        
        if self.counter % 1 == 0:
            value = int(100.00 * self.counter / self.max_tweets)
            mining_time = datetime.now() - self.start_time
            progress_bar.value = value
            html_value = """<span class="label label-primary">Tweets/Sec: %.1f</span>""" % (self.counter / max([1,mining_time.seconds]))
            html_value += """ <span class="label label-success">Progress: %.1f%%</span>""" % (self.counter / self.max_tweets * 100.0)
            html_value += """ <span class="label label-info">ETA: %.1f Sec</span>""" % ((self.max_tweets - self.counter) / (self.counter / max([1,mining_time.seconds])))
            wgt_status.value = html_value
            #print("%s/%s" % (self.counter, self.max_tweets))
            if self.counter >= self.max_tweets:
                myStream.disconnect()
                print("Finished")
                print("Total Mining Time: %s" % (mining_time))
                print("Tweets/Sec: %.1f" % (self.max_tweets / mining_time.seconds))
                progress_bar.value = 0
                
    
myStreamListener = MyStreamListener(max_tweets=100)
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener)

Connect to a streaming API

There are two methods to connect to a stream:

  • filter(follow=None, track=None, async=False, locations=None, stall_warnings=False, languages=None, encoding='utf8', filter_level=None)
  • firehose(count=None, async=False)

Firehose captures everything. You should make sure that you have connection speed that can handle the stream and you have the storage capacity that can store these tweets at the same rate. We cannot really use firehose for this tutorial but we'll be using filter.

You have to specify one of two things to filter:

  • follow: A list of user ID to follow. This will stream all their tweets, retweets, and others retweeting their tweets. This doesn't include mentions and manual retweets where the user doesn't press the retweet button.
  • track: A string or list of string to be used for filtering. If you use multiple words separated by spaces, this will be used for AND operator. If you use multiple words in a string separated by commas or pass a list of words this will be treated as OR operator.

Note: track is case insensitive.

What to track?

I want to collect all tweets that contains any of these words:

  • Jupyter
  • Python
  • Data Mining
  • Machine Learning
  • Data Science
  • Big Data
  • IoT
  • #R

This could be done with a string or a list. It is easier to to it with a list to make your code clear to read.


In [5]:
keywords = ["Jupyter",
            "Python",
            "Data Mining",
            "Machine Learning",
            "Data Science",
            "Big Data",
            "DataMining",
            "MachineLearning",
            "DataScience",
            "BigData",
            "IoT",
            "#R",
           ]

# Visualize a progress bar to track progress
progress_bar = wgt.IntProgress(value=0)
display(progress_bar)
wgt_status = wgt.HTML(value="""<span class="label label-primary">Tweets/Sec: 0.0</span>""")
display(wgt_status)

# Start a filter with an error counter of 20
for error_counter in range(20):
    try:
        myStream.filter(track=keywords)
        print("Tweets collected: %s" % myStream.listener.counter)
        print("Total tweets in collection: %s" % col.count())
        break
    except:
        print("ERROR# %s" % (error_counter + 1))


Finished
Total Mining Time: 0:01:21.477351
Tweets/Sec: 1.2
Tweets collected: 100
Total tweets in collection: 2351

Data Access and Analysis

Now that we have stored all these tweets in a MongoDB collection, let's take a look at one of these tweets


In [6]:
col.find_one()


Out[6]:
{'_id': ObjectId('56937d2e105f1970314720e2'),
 'contributors': None,
 'coordinates': None,
 'created_at': 'Mon Jan 11 10:00:14 +0000 2016',
 'entities': {'hashtags': [{'indices': [22, 27], 'text': 'Rの法則'}],
  'symbols': [],
  'urls': [],
  'user_mentions': []},
 'favorite_count': 0,
 'favorited': False,
 'filter_level': 'low',
 'geo': None,
 'id': 686487772970942466,
 'id_str': '686487772970942466',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'is_quote_status': False,
 'lang': 'ja',
 'place': None,
 'retweet_count': 0,
 'retweeted': False,
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'text': '体力落ちてきておばさんみたいになってきた。\n#Rの法則',
 'timestamp_ms': '1452506414059',
 'truncated': False,
 'user': {'contributors_enabled': False,
  'created_at': 'Tue Aug 18 16:19:16 +0000 2015',
  'default_profile': True,
  'default_profile_image': False,
  'description': '☮ 関ジャニ∞ & 山田涼介 & Justin Bieber & Benjamin Lasnier & Selena Gomez ☮',
  'favourites_count': 1121,
  'follow_request_sent': None,
  'followers_count': 121,
  'following': None,
  'friends_count': 92,
  'geo_enabled': True,
  'id': 3318871652,
  'id_str': '3318871652',
  'is_translator': False,
  'lang': 'en',
  'listed_count': 0,
  'location': 'The land of dreams',
  'name': 'rena',
  'notifications': None,
  'profile_background_color': 'C0DEED',
  'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
  'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
  'profile_background_tile': False,
  'profile_banner_url': 'https://pbs.twimg.com/profile_banners/3318871652/1452436374',
  'profile_image_url': 'http://pbs.twimg.com/profile_images/683964013558931456/Q1rx1s5b_normal.jpg',
  'profile_image_url_https': 'https://pbs.twimg.com/profile_images/683964013558931456/Q1rx1s5b_normal.jpg',
  'profile_link_color': '0084B4',
  'profile_sidebar_border_color': 'C0DEED',
  'profile_sidebar_fill_color': 'DDEEF6',
  'profile_text_color': '333333',
  'profile_use_background_image': True,
  'protected': False,
  'screen_name': 'Q2HpiJwCX1huBwf',
  'statuses_count': 497,
  'time_zone': None,
  'url': None,
  'utc_offset': None,
  'verified': False}}

Load results to a DataFrame


In [29]:
dataset = [{"created_at": item["created_at"],
            "text": item["text"],
            "user": "@%s" % item["user"]["screen_name"],
            "source": item["source"],
           } for item in col.find()]

dataset = pd.DataFrame(dataset)
dataset


Out[29]:
created_at source text user
0 Mon Jan 11 10:00:14 +0000 2016 <a href="http://twitter.com/download/iphone" r... 体力落ちてきておばさんみたいになってきた。\n#Rの法則 @Q2HpiJwCX1huBwf
1 Mon Jan 11 10:09:26 +0000 2016 <a href="http://twitter.com/download/android" ... 皆におばさんと言われてうれしがってる #Rの法則 @Tamutamu1017
2 Mon Jan 11 10:00:10 +0000 2016 <a href="http://trendkeyword.blog.jp/" rel="no... 【R.I.P】急上昇ワード「R.I.Pã€ã®ã¾ã¨ã‚é€Ÿå ± https://t.co/yi1yfC... @pickword_matome
3 Mon Jan 11 10:00:10 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... #Rの法則 \nどれもおばさん臭いけれどやっぱり黄色が一番だなぁ @kakinotise
4 Mon Jan 11 10:00:10 +0000 2016 <a href="http://bufferapp.com" rel="nofollow">... The New Best Thing HP ATP - Vertica Big Data S... @DataCentreNews1
5 Mon Jan 11 10:00:11 +0000 2016 <a href="http://dlvr.it" rel="nofollow">dlvr.i... IoT Now: l’Internet of Things è qui, ora https... @datamanager_it
6 Mon Jan 11 10:00:11 +0000 2016 <a href="http://trendkeyword.doorblog.jp/" rel... 今話題の「R.I.P」まとめ https://t.co/VOc5cwK5hg #R.I.P ... @buzz_wadai
7 Mon Jan 11 10:00:11 +0000 2016 <a href="http://twitterfeed.com" rel="nofollow... #oldham #stockport VIDEO: Snake thief hides py... @Labour_is_PIE
8 Mon Jan 11 10:00:11 +0000 2016 <a href="https://about.twitter.com/products/tw... Las #startup pioneras de #machinelearning ofre... @techreview_es
9 Mon Jan 11 10:00:12 +0000 2016 <a href="http://www.linkedin.com/" rel="nofoll... Lets talk about how to harness the power of ma... @jansmit1
10 Mon Jan 11 10:00:13 +0000 2016 <a href="http://catalystfive.com" rel="nofollo... Business Intelligence and Big Data Consulting ... @Catalyst5Jobs
11 Mon Jan 11 10:00:13 +0000 2016 <a href="http://twitter.com/NewsICT" rel="nofo... [æƒ…å ±é€šä¿¡]2016年台北国際コンピューター見本市が新しい位置づけと新しい展示で装い新たに!... @NewsICT
12 Mon Jan 11 10:02:10 +0000 2016 <a href="http://dlvr.it" rel="nofollow">dlvr.i... #bonplan Parties de Laser Quest entre amis à 2... @Bons_Plans_
13 Mon Jan 11 10:02:10 +0000 2016 <a href="http://dlvr.it" rel="nofollow">dlvr.i... Parties de Laser Quest entre amis à 22.00€ au ... @keepmymindfree
14 Mon Jan 11 10:02:10 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... RT @jose_garde: Why a Simple Data Analytics St... @martingeldish
15 Mon Jan 11 10:02:11 +0000 2016 <a href="http://twitter.com/download/iphone" r... 芸能人の人たくさん手たたき笑いしてるからおばさんたくさんになっちゃうよwww\n\n#Rの法則 @YK__0704
16 Mon Jan 11 10:02:12 +0000 2016 <a href="http://dlvr.it" rel="nofollow">dlvr.i... Découvrez le jeu Pure Mission entre amis à 22.... @keepmymindfree
17 Mon Jan 11 10:02:12 +0000 2016 <a href="http://dlvr.it" rel="nofollow">dlvr.i... #bonplan Parties de bowling pour 4 Ã #POINCY :... @Bons_Plans_
18 Mon Jan 11 10:02:12 +0000 2016 <a href="http://dlvr.it" rel="nofollow">dlvr.i... 20 min de vol découverte ULM pour 1 ou 2 à 79.... @CrationSiteWeb
19 Mon Jan 11 10:02:12 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... RT @hortonworks: Paris is the city of love but... @bigdataparis
20 Mon Jan 11 10:02:12 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... RT @hynek: So #emacs / @spacemacs nerds: is th... @fdiesch
21 Mon Jan 11 10:02:12 +0000 2016 <a href="http://www.hootsuite.com" rel="nofoll... .@QonexCyber founder member of @IoT_SF is orga... @QonexCyber
22 Mon Jan 11 10:02:13 +0000 2016 <a href="http://dlvr.it" rel="nofollow">dlvr.i... Parties de bowling pour 4 à #POINCY : 35.00€ a... @keepmymindfree
23 Mon Jan 11 10:02:13 +0000 2016 <a href="http://dlvr.it" rel="nofollow">dlvr.i... #bonplan 30 séances de Squash à #LISSES : 39.9... @Bons_Plans_
24 Mon Jan 11 10:02:13 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... App: ExZeus 2 – free to play https://t.co/ZT... @UniversalConsol
25 Mon Jan 11 10:02:13 +0000 2016 <a href="http://dlvr.it" rel="nofollow">dlvr.i... #discount Parties de bowling pour 4 Ã #POINCY ... @PromosPromos
26 Mon Jan 11 10:02:13 +0000 2016 <a href="http://www.google.com/" rel="nofollow... spiegel.de : Tier macht Sachen: Python beißt ... @arminfischer_de
27 Mon Jan 11 10:09:02 +0000 2016 <a href="http://twitter.com/download/android" ... 若いっていいねえ…ってよく言うw #Rの法則 @naco75x
28 Mon Jan 11 10:09:17 +0000 2016 <a href="http://twitter.com/download/iphone" r... 最近の若い子は最近使った\n#Rの法則 @K1224West
29 Mon Jan 11 10:09:19 +0000 2016 <a href="http://twitter.com/download/android" ... #Rの法則\n自分も若者なのに笑 @V6ZRRT7Q22BZ1cF
... ... ... ... ...
2221 Mon Jan 11 10:29:40 +0000 2016 <a href="http://201512291327-7430af.bitnamiapp... https://t.co/BTAAq6HuuJ - pcgamer - #machinele... @vinceyue
2222 Mon Jan 11 10:29:40 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... RT @jose_garde: The Big Data Analytics Softwar... @LJ_Blanchard
2223 Mon Jan 11 10:29:41 +0000 2016 <a href="http://www.linkedin.com/" rel="nofoll... What is data mining? Do you have to be a mathe... @ednuwan
2224 Mon Jan 11 10:29:41 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... RT @AgroKnow: A #BigData platform for the futu... @albertspijkers
2225 Mon Jan 11 10:29:42 +0000 2016 <a href="http://www.linkedin.com/" rel="nofoll... Big Data: Is It A Tsunami, The New Oil, Or Sim... @Summerlovegrove
2226 Mon Jan 11 10:29:43 +0000 2016 <a href="https://www.jobfindly.com/php-jobs.ht... Sr Software Engineer C Php Python Linux Jobs i... @jobfindlyphpdev
2227 Mon Jan 11 10:29:43 +0000 2016 <a href="http://twitter.com/download/iphone" r... RT @bigdataparis: #Bigdata bang : un marché en... @LifeIsWeb
2228 Mon Jan 11 10:29:43 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... Learn from the best professors in India .. cou... @ashwaniapex
2229 Mon Jan 11 10:29:45 +0000 2016 <a href="http://getsmoup.com" rel="nofollow">S... RT @rebrandtoday: #startup or #rebrand -Buy Cr... @SmartData_Fr
2230 Mon Jan 11 10:29:45 +0000 2016 <a href="http://www.ajaymatharu.com/" rel="nof... ¿Cómo será el futuro del Big Data? https://t.c... @eduardogarsanch
2231 Mon Jan 11 10:29:46 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... RT @jose_garde: 3 Ways to Transform Your Compa... @LJ_Blanchard
2232 Mon Jan 11 10:29:46 +0000 2016 <a href="http://publicize.wp.com/" rel="nofoll... Woman Tries To Kiss Python, Gets Bitten In The... @NAIJA_VIBEZ
2233 Mon Jan 11 10:29:46 +0000 2016 <a href="http://www.itknowingness.com" rel="no... RT @jose_garde: 3 Ways to Transform Your Compa... @itknowingness
2234 Mon Jan 11 10:29:48 +0000 2016 <a href="http://twitter.com/download/iphone" r... RT @ErikaPauwels: Building a #BigData platform... @impulsater
2235 Mon Jan 11 10:29:49 +0000 2016 <a href="http://twitter.com/download/android" ... En 2016 j'aimerais moins râler. #résolution. S... @ce1ce2makarenko
2236 Mon Jan 11 10:29:51 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... RT @jose_garde: How Marketing Can Be Better Au... @LJ_Blanchard
2237 Mon Jan 11 10:29:51 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... Hey @Pontifex accueille ces réfugiées dans ta ... @Atmosfive
2238 Mon Jan 11 10:29:51 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... RT @Ubixr: .@GroupeLaPoste choisit le toulousa... @The_Nextwork
2239 Mon Jan 11 10:29:56 +0000 2016 <a href="http://twitterfeed.com" rel="nofollow... Thanks @hackplayers Blade: un webshell en Pyth... @Navarmedia
2240 Mon Jan 11 10:29:56 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... RT @jose_garde: Which big data personality are... @LJ_Blanchard
2241 Mon Jan 11 10:29:56 +0000 2016 <a href="http://ifttt.com" rel="nofollow">IFTT... Cybersecurity Forum tackles challenges with th... @wulfsec
2242 Mon Jan 11 10:29:56 +0000 2016 <a href="http://publicize.wp.com/" rel="nofoll... Woman Tries To Kiss Python, Gets Bitten In The... @Lola2Records
2243 Mon Jan 11 10:29:56 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... RT @TeamAnodot: Join David Drai, CEO of Anodot... @iottechexpo
2244 Mon Jan 11 10:29:57 +0000 2016 <a href="http://getsmoup.com" rel="nofollow">A... RT @rebrandtoday: #startup or #rebrand -Buy Cr... @AI__news
2245 Mon Jan 11 10:29:57 +0000 2016 <a href="http://www.twitter.com" rel="nofollow... RT @Matthis__VERNON: "@Fred_Poquet Sans #confi... @sibueta
2246 Mon Jan 11 10:29:57 +0000 2016 <a href="https://social.zoho.com" rel="nofollo... The right place for #BigData is #Cloud #Storag... @TyroneSystems
2247 Mon Jan 11 10:29:59 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... RT @PEBlanrue: Il a toujours le mot pour rire,... @lesroisduring
2248 Mon Jan 11 10:29:58 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... RT @DigitalAgendaEU: €15 million for a #IoT so... @ImproveNPA
2249 Mon Jan 11 10:29:59 +0000 2016 <a href="http://twitter.com" rel="nofollow">Tw... Neat IoT innovation https://t.co/atARX0m5Bj @sherwinnovator
2250 Mon Jan 11 10:30:00 +0000 2016 <a href="http://www.hubspot.com/" rel="nofollo... Check out our #Mobile App Predicitions for 201... @B60uk

2251 rows × 4 columns

Checking the highest used words


In [30]:
cv = CountVectorizer()
count_matrix = cv.fit_transform(dataset.text)

word_count = pd.DataFrame(cv.get_feature_names(), columns=["word"])
word_count["count"] = count_matrix.sum(axis=0).tolist()[0]
word_count = word_count.sort_values("count", ascending=False).reset_index(drop=True)
word_count[:50]


Out[30]:
word count
0 https 1986
1 co 1907
2 rt 804
3 de 550
4 rの法則 408
5 iot 374
6 the 358
7 bigdata 293
8 00 275
9 data 250
10 in 234
11 python 219
12 to 212
13 au 199
14 of 188
15 lieu 168
16 réduction 166
17 big 157
18 on 143
19 is 142
20 and 140
21 for 136
22 analytics 107
23 le 89
24 via 86
25 you 86
26 thingsexpo 86
27 by 85
28 2016 84
29 snake 80
30 en 76
31 bowie 75
32 la 74
33 thief 74
34 video 73
35 m2m 70
36 jose_garde 68
37 19 67
38 david 66
39 with 63
40 how 61
41 it 60
42 will 55
43 un 54
44 amp 53
45 des 53
46 réparation 53
47 new 52
48 39 52
49 at 51

Visualization


In [37]:
def get_source_name(x):
    value = re.findall(pattern="<[^>]+>([^<]+)</a>", string=x)
    if len(value) > 0:
        return value[0]
    else:
        return ""

In [38]:
dataset.source_name = dataset.source.apply(get_source_name)

source_counts = dataset.source_name.value_counts().sort_values()[-10:]

bottom = [index for index, item in enumerate(source_counts.index)]
plt.barh(bottom, width=source_counts, color="orange", linewidth=0)

y_labels = ["%s %.1f%%" % (item, 100.0*source_counts[item]/len(dataset)) for index,item in enumerate(source_counts.index)]
plt.yticks(np.array(bottom)+0.4, y_labels)

source_counts


Out[38]:
Facebook                25
TweetDeck               37
RoundTeam               41
Hootsuite               46
twitterfeed             81
IFTTT                  134
dlvr.it                200
Twitter Web Client     388
Twitter for Android    392
Twitter for iPhone     515
Name: source, dtype: int64